{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# An Introduction to PCA with MNIST\n",
    "_**Investigating Eigendigits from Principal Components Analysis on Handwritten Digits**_\n",
    "\n",
    "1. [Introduction](#Introduction)\n",
    "2. [Prerequisites and Preprocessing](#Prequisites-and-Preprocessing)\n",
    "  1. [Permissions and environment variables](#Permissions-and-environment-variables)\n",
    "  2. [Data ingestion](#Data-ingestion)\n",
    "  3. [Data inspection](#Data-inspection)\n",
    "  4. [Data conversion](#Data-conversion)\n",
    "3. [Training the PCA model](#Training-the-PCA-model)\n",
    "4. [Set up hosting for the model](#Set-up-hosting-for-the-model)\n",
    "  1. [Import model into hosting](#Import-model-into-hosting)\n",
    "  2. [Create endpoint configuration](#Create-endpoint-configuration)\n",
    "  3. [Create endpoint](#Create-endpoint)\n",
    "5. [Validate the model for use](#Validate-the-model-for-use)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction\n",
    "\n",
    "Welcome to our example introducing Amazon SageMaker's PCA Algorithm! Today, we're analyzing the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset which consists of images of handwritten digits, from zero to nine.  We'll ignore the true labels for the time being and instead focus on what information we can obtain from the image pixels alone.\n",
    "\n",
    "The method that we'll look at today is called Principal Components Analysis (PCA).  PCA is an unsupervised learning algorithm that attempts to reduce the dimensionality (number of features) within a dataset while still retaining as much information as possible. This is done by finding a new set of feature dimensions called principal components, which are composites of the original features that are uncorrelated with one another. They are also constrained so that the first component accounts for the largest possible variability in the data, the second component the second most variability, and so on.\n",
    "\n",
    "PCA is most commonly used as a pre-processing step.  Statistically, many models assume data to be low-dimensional.  In those cases, the output of PCA will actually include much less of the noise and subsequent models can be more accurate.  Taking datasets with a huge number of features and reducing them down can be shown to not hurt the accuracy of the clustering while enjoying significantly improved performance.  In addition, using PCA in advance of a linear model can make overfitting due to multi-collinearity less likely.\n",
    "\n",
    "For our current use case though, we focus purely on the output of PCA.  [Eigenfaces](https://en.wikipedia.org/wiki/Eigenface) have been used for years in facial recognition and computer vision.  The eerie images represent a large library of photos as a smaller subset.  These eigenfaces are not necessarily clusters, but instead highlight key features, that when combined, can represent most of the variation in faces throughout the entire library.  We'll follow an analagous path and develop eigendigits from our handwritten digit dataset.\n",
    "\n",
    "To get started, we need to set up the environment with a few prerequisite steps, for permissions, configurations, and so on."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prequisites and Preprocessing\n",
    "\n",
    "### Permissions and environment variables\n",
    "\n",
    "_This notebook was created and tested on an ml.m4.xlarge notebook instance._\n",
    "\n",
    "Let's start by specifying:\n",
    "\n",
    "- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.\n",
    "- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "isConfigCell": true
   },
   "outputs": [],
   "source": [
    "bucket = '<your_s3_bucket_name_here>'\n",
    "prefix = 'sagemaker/DEMO-pca-mnist'\n",
    " \n",
    "# Define IAM role\n",
    "import boto3\n",
    "import re\n",
    "from sagemaker import get_execution_role\n",
    "\n",
    "role = get_execution_role()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data ingestion\n",
    "\n",
    "Next, we read the dataset from an online URL into memory, for preprocessing prior to training. This processing could be done *in-situ* by Amazon Athena, Apache Spark in Amazon EMR, Amazon Redshift, etc., assuming the dataset is present at the appropriate location. Then, the next step would be to transfer the data to S3 for use in training. For small datasets such as this one, reading into memory isn't onerous, though it would be for larger datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%%time\n",
    "import pickle, gzip, numpy, urllib.request, json\n",
    "\n",
    "# Load the dataset\n",
    "urllib.request.urlretrieve(\"http://deeplearning.net/data/mnist/mnist.pkl.gz\", \"mnist.pkl.gz\")\n",
    "with gzip.open('mnist.pkl.gz', 'rb') as f:\n",
    "    train_set, valid_set, test_set = pickle.load(f, encoding='latin1')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data inspection\n",
    "\n",
    "Once the dataset is imported, it's typical as part of the machine learning process to inspect the data, understand the distributions, and determine what type(s) of preprocessing might be needed. You can perform those tasks right here in the notebook. As an example, let's go ahead and look at one of the digits that is part of the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "plt.rcParams[\"figure.figsize\"] = (2,10)\n",
    "\n",
    "\n",
    "def show_digit(img, caption='', subplot=None):\n",
    "    if subplot==None:\n",
    "        _,(subplot)=plt.subplots(1,1)\n",
    "    imgr=img.reshape((28,28))\n",
    "    subplot.axis('off')\n",
    "    subplot.imshow(imgr, cmap='gray')\n",
    "    plt.title(caption)\n",
    "\n",
    "show_digit(train_set[0][30], 'This is a {}'.format(train_set[1][30]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data conversion\n",
    "\n",
    "Since algorithms have particular input and output requirements, converting the dataset is also part of the process that a data scientist goes through prior to initiating training. In this particular case, the Amazon SageMaker implementation of PCA takes recordIO-wrapped protobuf, where the data we have today is a pickle-ized numpy array on disk.\n",
    "\n",
    "Most of the conversion effort is handled by the Amazon SageMaker Python SDK, imported as `sagemaker` below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import io\n",
    "import numpy as np\n",
    "import sagemaker.amazon.common as smac\n",
    "\n",
    "vectors = np.array([t.tolist() for t in train_set[0]]).T\n",
    "\n",
    "buf = io.BytesIO()\n",
    "smac.write_numpy_to_dense_tensor(buf, vectors)\n",
    "buf.seek(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Upload training data\n",
    "Now that we've created our recordIO-wrapped protobuf, we'll need to upload it to S3, so that Amazon SageMaker training can use it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "%%time\n",
    "import boto3\n",
    "import os\n",
    "\n",
    "key = 'recordio-pb-data'\n",
    "boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train', key)).upload_fileobj(buf)\n",
    "s3_train_data = 's3://{}/{}/train/{}'.format(bucket, prefix, key)\n",
    "print('uploaded training data location: {}'.format(s3_train_data))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's also setup an output S3 location for the model artifact that will be output as the result of training with the algorithm."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "output_location = 's3://{}/{}/output'.format(bucket, prefix)\n",
    "print('training artifacts will be uploaded to: {}'.format(output_location))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training the PCA model\n",
    "\n",
    "Once we have the data preprocessed and available in the correct format for training, the next step is to actually train the model using the data. Since this data is relatively small, it isn't meant to show off the performance of the PCA training algorithm, although we have tested it on multi-terabyte datasets.\n",
    "\n",
    "Again, we'll use the Amazon SageMaker Python SDK to kick off training, and monitor status until it is completed.  In this example that takes between 7 and 11 minutes.  Despite the dataset being small, provisioning hardware and loading the algorithm container take time upfront.\n",
    "\n",
    "First, let's specify our containers.  Since we want this notebook to run in all 4 of Amazon SageMaker's regions, we'll create a small lookup.  More details on algorithm containers can be found in [AWS documentation](https://docs-aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sagemaker.amazon.amazon_estimator import get_image_uri\n",
    "container = get_image_uri(boto3.Session().region_name, 'pca')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we'll kick off the base estimator, making sure to pass in the necessary hyperparameters.  Notice:\n",
    "- `feature_dim` is set to 50000.  We've transposed our datasets relative to most of the other MNIST examples because for eigendigits we're looking to understand pixel relationships, rather than make predictions about individual images.\n",
    "- `num_components` has been set to 10.  This could easily be increased for future experimentation.  In practical settings, setting the number of components typically uses a mixture of objective and subjective criteria.  Data Scientists tend to look for the fewest principal components that eat up the most variation in the data.\n",
    "- `subtract_mean` standardizes the pixel intensity across all images.  The MNIST data has already been extensively cleaned, but including this shouldn't hurt.\n",
    "- `algorithm_mode` is set to 'randomized'.  Because we have a very large number of dimensions, this makes the most sense.  The alternative 'stable' should be used in cases with a lower value for `feature_dim`.\n",
    "- `mini_batch_size` has been set to 200.  For PCA, this parameter should not affect fit, but may have slight implications on timing.  Other algorithms may require tuning of this parameter in order to achieve the best results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import boto3\n",
    "import sagemaker\n",
    "\n",
    "sess = sagemaker.Session()\n",
    "\n",
    "pca = sagemaker.estimator.Estimator(container,\n",
    "                                    role, \n",
    "                                    train_instance_count=1, \n",
    "                                    train_instance_type='ml.c4.xlarge',\n",
    "                                    output_path=output_location,\n",
    "                                    sagemaker_session=sess)\n",
    "pca.set_hyperparameters(feature_dim=50000,\n",
    "                        num_components=10,\n",
    "                        subtract_mean=True,\n",
    "                        algorithm_mode='randomized',\n",
    "                        mini_batch_size=200)\n",
    "\n",
    "pca.fit({'train': s3_train_data})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set up hosting for the model\n",
    "Now that we've trained our model, we can deploy it behind an Amazon SageMaker real-time hosted endpoint.  This will allow out to make predictions (or inference) from the model dyanamically.\n",
    "\n",
    "_Note, Amazon SageMaker allows you the flexibility of importing models trained elsewhere, as well as the choice of not importing models if the target of model creation is AWS Lambda, AWS Greengrass, Amazon Redshift, Amazon Athena, or other deployment target._"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "pca_predictor = pca.deploy(initial_instance_count=1,\n",
    "                           instance_type='ml.m4.xlarge')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Validate the model for use\n",
    "Finally, we can now validate the model for use.  We can pass HTTP POST requests to the endpoint to get back predictions.  To make this easier, we'll again use the Amazon SageMaker Python SDK and specify how to serialize requests and deserialize responses that are specific to the algorithm."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sagemaker.predictor import csv_serializer, json_deserializer\n",
    "\n",
    "pca_predictor.content_type = 'text/csv'\n",
    "pca_predictor.serializer = csv_serializer\n",
    "pca_predictor.deserializer = json_deserializer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's try getting a prediction for a single record."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "result = pca_predictor.predict(train_set[0][:, 0])\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "OK, a single prediction works.  We see that for one record our endpoint returned some JSON which contains a value for each of the 10 principal components we created when training the model.\n",
    "\n",
    "Let's do a whole batch and see what comes out."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "eigendigits = []\n",
    "for array in np.array_split(train_set[0].T, 100):\n",
    "    result = pca_predictor.predict(array)\n",
    "    eigendigits += [r['projection'] for r in result['projections']]\n",
    "\n",
    "\n",
    "eigendigits = np.array(eigendigits).T"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "for e in enumerate(eigendigits):\n",
    "    show_digit(e[1], 'eigendigit #{}'.format(e[0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Not surprisingly, the eigendigits aren't extremely interpretable.  They do show interesting elements of the data, with eigendigit #0 being the \"anti-number\", eigendigit #1 looking a bit like a `0` combined with the inverse of a `3`, eigendigit #2 showing some shapes resembling a `9`, and so on."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### (Optional) Delete the Endpoint\n",
    "\n",
    "If you're ready to be done with this notebook, please run the delete_endpoint line in the cell below.  This will remove the hosted endpoint you created and avoid any charges from a stray instance being left on."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import sagemaker\n",
    "\n",
    "sagemaker.Session().delete_endpoint(pca_predictor.endpoint)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Environment (conda_python3)",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  },
  "notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved.  Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
